#draft - rewrite it later
Hi dear scientist! We are glad to see that you decided to go on the road to an adventure in computational biology. We want you to not get lost, and give you some tips and tricks on how to use the cluster and what are the best practices we are trying to use to keep our work organized.
We believe that standardization of work processes is good for reproducibility in data science and make it much easier to collaborate on projects.
Here, in the next hour, you will learn how to keep your data organized and structured, so if you accidentally die, your colleagues could continue working on your project without any delays.
So, you want to connect to the cluster to run RStrudio, or process your NGS data, how do you do that? Let’s set up your machine first.
Here we will cover the very basic topics such as setting up the IDE, connecting to a server, basic Linux commands, and some tips on how to keep your project organized.
There are different ways to access and work on an HPC cluster. For instance, macOS users have pre-installed Terminal on their systems. Another very popular solution - iTerm2. For Windows users, there are Windows Terminal or mobaXterm. But luckily it doesn’t matter so much whether you have Windows or Mac, we recommend using VS Code, as a most simple and reliable tool.
The first step - download the VS Code, install, and run it. It will look something like this
On the left panel you can see several tabs:
Explorer - allows you to navigate in your working space, and observe files and folders
Search - Search in documents
Source control - gives you control over a git repository. We will cover that later
Run And Debug - gives you the possibility to use automated debugging. We don’t need it for now
Extensions - the main beauty of VS Code. Allows to install different modules that extend the capabilities of the IDE
Profile - allows you to connect your GitHub account and synchronize settings across different devices
Settings - settings and more ;)
If VS Code is new to you, we recommend having a look at the guides that VS Code offers to check: “Get Started with VS Code” and “Learn the Fundamentals”. Also, the official guides from Microsoft are really good.
Before we begin, we need to do a couple of adjustments to the VS Code, so that it works well with a cluster. This is important, please don’t skip this step.
First, we will deactivate FileWatcher - this is a plugin that is constantly checking if files are changed in an open directory. This is convenient, but if you work in a folder that has many files it can load a CPU heavily. To do this follow these steps:
Second, we will deactivate TypeScript and JavaScript Language Features Support. Sometimes it can load a CPU as well. Do the following steps:
Extensions > @builtin TypeScript and JavaScript > Disable > Reload
Now you need to install “Remote - SSH” plugging from the “Extensions” tab. Also, before connecting to the server, make sure that you have your account set up and Isilon storage mounted. To do that, contact BICU, we will help you. Ok, now you are ready to connect to the cluster. Follow these simple steps:
Click “><“ symbol in the left lower corner > Connect to Host > + Add New SSH Host… Or select one that you have set up already.
Select the location where you want to store the config (the default is fine)
Then type in “ssh your_user_name@machine_ip”, where you_user_name is the name that you got from us and IP addresses:
| Machine Name | Machine Linux Name | IP address |
| Biodirt | 011SV155.AD.CCRI.AT | 10.5.1.155 |
| Biohazard | 011SV157.AD.CCRI.AT | 10.5.1.157 |
| Biowaste | 011SV154.AD.CCRI.AT | 10.5.1.154 |
Enter your password (if it’s the first time you log in to your account, then you have a generic password that you must change asap) and hit enter. You might be asked if you trust the connection, or if you want to
It doesn’t look like much happened, but you are on the server. Now, let’s go to the next step and learn some basic Linux commands.
Our server is a Linux machine and working with it involves a terminal and using Bash scripting language. If you have experience with Unix systems, feel free to skip this section, as it involves very basic concepts. If you never worked with it before, in the beginning, the terminal can seem to be overwhelming and feels awkward, but you will get used to it and soon will see how fast and easy you can make many things using bash. Now, let’s learn basic commands that are absolutely essential. To start using the terminal press Ctrl+Shift+` or Terminal > New Terminal in the menu upper bar.
First, let’s check where we are now, by typing
pwdThe output will look like this:
/home/test_userWhere / is a root directory,
/home is the folder where folders of all users are stored,
and /you_user_name - is your home folder.
Now, let’s check what we have in our folder by typing
ls Now we don’t have much in our folder, but you should
have Isilon storage mounted, so you should see it in the output:
bioinf_isilon
But this is not everything, there are also hidden files. If you
want to see them, try:
ls -a
you will see a bunch of files like these: .bashrc, .profile,
.bash_logout, etc. Files starting with a dot are hidden by
default.
When you will have a lot of files, it becomes more practical to
visualize them as a list. To do that use
ls -lh
-l stands for “list” and -h for
“Human-readable”, so the size of the files is depicted as Mb,
Gb, Kb, which is easier to read.
Ok, now we can peek at what we have in the bioinf_isilon
directory:
ls bioinf_isilon/
We will see several folders:
core_bioinformatics_unit Labdia _OLD-TEST Research zArchive zClipboard zrawdata
Great, now let’s learn how to move around. Start typing
cd and press “Tab”, so that the terminal does
autocompletion and you could see what folders are out there. So, try to
move to the folder of your group, or anywhere really:
cd bioing_isilon/Research/YOUR_GROUP/Public
If you want to move to a folder above, you can use
cd ../
To a previous location use cd -
To come back to your home directory please just type
cd ~
One important thing to remember is that in comparison to Windows, Linux
is case-sensitive, so Research, research,
RESEARCH and ReSeArCh are all different
names.
Great! Now we are able to move around and see what we have in different directories. Now let’s try to create files, and directories, rename, copy, cut, and delete them.
Ok, assuming that you are in your home directory, let’s create a new folder, using this command:
mkdir test_folder
Check that the folder is there, with lsOne important note -
it’s generally a good idea to avoid using special characters and spaces
in the names of directories and files. It’s still possible, but you
would need to use escape characters and it makes everything much more
cumbersome. So, try to avoid it.
It’s also possible to create several folders inside one another.
This, you need to use the parameter -p Let’s do it and move
there
mkdir -p test_folder/another_test/just_one_more
cd test_folder/another_test/just_one_more
Atm, it feels a bit empty, so let’s create a file inside. You can
do it in several ways, for example, you can use this command:
touch file.txt
FYI it’s also possible to do it using GUI in the VS Code. To do that, click in the left panel Explorer > Open Folder > Ok. You will see a tree of files and buttons “Create file”, “Create Folder” and “Refresh”.
When you created the file, let’s write something inside. There
are countless ways to do that, but assuming you have the
“Explorer” open, find the file, open it, and write something
inside.
To observe the content of a file we typically use the following
commands:
less file.txt - allows look at the whole file. Navigation
by arrows, and space bar. To exit press q
head file.txt - allows looking at the first 10, or n
rows
tail file.txt - the same, but for the last rows
Now, let’s try to copy files using:
cp file.txt ../
we specify first what we want to copy, then the destination. If you want
to change the original name of the file, just specify the new name at
the end:
cp file.txt ../file_copy.txt
Moving and renaming works very similarly - use:
mv ../file_copy.txt ../file_with_new_name.txt
Now, let’s try to remove the [file:\\](file:){.uri} rm file.txt
Important note - rm removes the file permanently. There
is no such a thing as “trash bin”. So, be very careful with what you are
removing.
Now let’s try to remove the folder that we created. First move to
the home directory. Then, we need to use -r
parameter:
rm -r test_folder
So, now we know how to create files and folders, rename them, copy and delete. Also, remember that you can do it in the GUI, but sometimes it can be easier to do with CLI. In principle, these are the most commonly used commands that you really need to remember. There are also others that you could find in the cheat sheet.
There are also several nice tutorials some of which are here:
Link
#1
One of the most important practices in bioinformatics work (probably in a wet lab it’s even more important) is making your work easily reproducible. So, it’s crucial to know what packages and what versions of them you are using, and properly record this information. And as many packages might have many dependencies, proper management of it is crucial, so that your project would not turn into a spaghetti monster.
There are different solutions out there and one of the most popular ones is Conda package manager, or its sister Mamba. They allow you to install packages of specific versions and easily create virtual environments for a project when you need them. To install Mamaba use this line of code while being in your home directory.
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh
Don’t forget to reload the terminal, so that mamba gets activated.
After installation, you have a very convenient way to control what to install and how. Now you are in your default (base) environment - it’s a good idea to store here the tools that you are going to use on a daily basis. And for specific projects, it makes sense sometimes to create separate environments. You will see how it works in the following sections of the tutorial. But first, we will install a couple of packages that are absolute must-haves.
First, let’s install tmux. This is a terminal multiplexer that allows you to create different sessions, that you can switch between easily and that will run your commands even when you get disconnected from a cluster. To do that, use this simple line of code:
mamba install tmux
When the installation is complete, type in tmux in the
terminal to start a session. Now you can see that not many things
changed at first glance, but in reality, now we have a separate tmux
process running and if we start some pipeline that will take let’s say 5
hours to complete it won’t crush if we get disconnected from the server.
Also, we can create easily switch windows, and tabs just like in a
normal web browser.
For instance, we can split the window into two panes. But first, a
small note about the controls in tmux. All commands in tmux are
triggered by a prefix key followed by a command
key. By default, tmux uses Ctrl+b as a prefix key
(often labeled as C-b). It can feel weird at the beginning,
but you will get used to it very fast.
So, if you want to split the screen into left and right, press
Ctrl+b, release, then press %. To switch
between them use C-b <arrow key>. Great! So, this is
very convenient when you want to have several terminals open, but want
it to be nicely organized. Now let’s see the true beauty of the
tmux.
Let’s run a very simple code that will just show us the date and time
every 3 seconds until we stop it:
while true; do 'date'; sleep 3; done
Now, if we detach from the current tmux session using
C-b d We went to just a regular terminal, let’s connect to
the running session. To do that, we need to know the name of the
session. let’s check it with:
tmux ls
You should see only one session with the name “0”. To connect to it
use:
tmux a -t 0
And boom, you are back! And you can notice that the session was
active - there are more dates printed (to stop the running process press
Ctrl+C). So, it might not seem like something
extraordinary, but by default normal terminals stop processes associated
with them when ssh tunnel is closed. So, if you started your precious
script, went to make a coffee and your computer went to sleep in the
meantime, the ssh tunnel breaks and your script gets stopped. So, tmux
creates constantly running processes. Also, if you work on several
projects it is very practical to have separate sessions, for each
individual project. It helps to keep everything organized.
And before we go to the next section, here are more handy
commands:
C-b x - close the current pane
C-b " - horizontal split
C-b w - show all windows
Also, here you can find some really good tutorials if you want to deep dive into the tmux:
So, now we are fully prepared for creating our first project and see what is the best way to keep the work organized and trackable.
Now we can simulate a typical project and show what could be a convenient way to organize your workflow. In this particular example, we will try to analyze RNA-seq data. You will see how to organize folder structure in a convenient way, and how to manage big data versions and your code. You will see how the usage of the Conda environment can be beneficial and we will show how how to share this information with your potential collaborators. Also, you will see how to run RStudio in a container.
First, we need to create a folder where we will locate the project. We would suggest you to create a folder in your research group folder if you still don’t have one. It makes life much easier for your colleagues if they deice to pick up your project when you leave, or re-analyze some data after a long time. A typical path would look like this:
/home/USER/bioinf_isilion/Research/GROUP/USER/projects/ID000_YYYYMM_ProjectName
The name of the project folder would contain the following info:\ ID000 - your ID (for
example I use AB as an acronym of my name and surname), followed by a
number in the format 001, 002, … It’s quite convenient to
give each project a unique index so that it would be easier for you to
search it.
YYYYMM - date in the format YEAR, month. The year goes
first, because when folders arranged by year first, it becomes easy to
sort them.
ProjectName - just a short description of what this project
is, eg. Neuroblastoma_RNAseq
A typical directories structure would look like this:
ID000_YYYYMM_ProjectName/ ├─ RProject/ - folder where you will store your Rproject│ ├─ Results/ - Results that you will produce with R code │ ├─ Misc/ - maybe you have some important metadata for R ├─ Data_Raw/ - Here it's either RAW data, or a symlink to it │ ├─ 01_Sample/ │ ├─ 02_Sample/ ├─ Data_processed/ - In some cases you can produce intermediate files├─ Src/ - You scripts for data processing │ ├─ 01_DataProcessingScript.sh │ ├─ 02_AnotherDataProcessingStep.sh ├─ Misc/ - Additional data, eg. references, metadata, etc.├─ .gitignore - part of your git repo├─ README.md - it's nice to have a short description of a project
Of course, it’s not a strict rule that you always must have such a structure. No. It’s just an idea of how it might look like. Just keep it clean and self-explanatory, or add a very good description of what is what. Make your, or a colleague’s life who will use your code after you easier.
So, move to a directory of your group and try to create a project
folder, and inside Data_Raw, Src and
Misc folders. When this is done, let’s pull a test dataset.
For this, move to Src and create a file
00_DataLoader.sh We will work with an example dataset -
RNA-seq data from human.
Open the created file and write there this code:
#!/bin/bash
wget http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar -O ../Data_Raw/HBR_UHR_ERCC_ds_5pc.tar
tar -xvf ../Data_Raw/HBR_UHR_ERCC_ds_5pc.tar -C ../Data_Raw/
rm ../Data_Raw/HBR_UHR_ERCC_ds_5pc.tar
Save the file, and execute it from the Src folder
with:
bash ./00_DataLoader.sh
2) Now, when we created the structure and populated our folder it's a good time to give some basic ideas how to keep track of changes that you make to your code, data and so on.
We will use 2 packages for this - git and dvc. Try to install them using mamba. They are general use packages, so we will install it to our base environment. The command should look like this:
mamba install git dvc
when the installation is complete.